{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualization in Spatial Data Analysis\n",
    "\n",
    "Having covered the formal representation of spatial data we now turn to a series\n",
    "of methods for the _visualization_ of spatial data. We continue on with the\n",
    "airbnb data that we explored in the previous notebook. Here we use it to\n",
    "describe a common type of geovizualization used for area unit data with numeric\n",
    "attributes, namely _choropleth maps_.\n",
    "\n",
    "## Choropleth Maps ##\n",
    "\n",
    "Choropleth maps play a prominent role in spatial data science.\n",
    "The word choropleth stems from the root \"choro\" meaning \"region\". As such\n",
    "choropleth maps are appropriate for areal unit data where each observation\n",
    "combines a value of an attribute and a polygon. Choropleth maps derive from an\n",
    "earlier era where cartographers faced technological constraints that precluded\n",
    "the use of unclassed maps where each unique attribute value could be represented\n",
    "by a distinct symbol. Instead, attribute values were grouped into a smaller\n",
    "number of classes with each class being associated with a unique symbol that was\n",
    "in turn applied to all polygons with attribute values falling in the class.\n",
    "\n",
    "Although today these technological constraints are no longer binding, and\n",
    "unclassed mapping is feasible, there are still good reasons for adopting a\n",
    "classed approach. Chief among these is to reduce the cognitive load involved in\n",
    "parsing the complexity of an unclassed map. A choropleth map reduces this\n",
    "complexity by drawing upon statistical and visualization theory to provide an\n",
    "effective representation of the spatial distribution of the attribute values\n",
    "across the areal units. \n",
    "\n",
    "The effectiveness of a choropleth map will be a\n",
    "function of the choice of classification scheme together with the color or\n",
    "symbolization strategy adopted. In broad terms, the classification scheme\n",
    "defines the number of classes as well as the rules for assignment, while the\n",
    "symbolization should convey information about the value differentiation across\n",
    "the classes.\n",
    "\n",
    "## Data classification ##\n",
    "\n",
    "Data classification considers the problem of a\n",
    "partitioning of the attribute values into mutually exclusive and exhaustive\n",
    "groups. The precise manner in which this is done will be a function of the\n",
    "measurement scale of the attribute in question. For quantitative attributes\n",
    "(ordinal, interval, ratio scales) the classes will have an explicit ordering.\n",
    "More formally, the classification problem is to define class boundaries such\n",
    "that\n",
    "$$\n",
    "c_j < y_i \\le  c_{j+1} \\ \\forall y_i \\in C_{j+1}\n",
    "$$\n",
    "where $y_i$ is the\n",
    "value of the attribute for spatial location $i$, $j$ is a class index, and $c_j$\n",
    "represents the lower bound of interval $j$.\n",
    "\n",
    "Different classification schemes\n",
    "obtain from their definition of the class boundaries. The choice of the\n",
    "classification scheme should take into consideration the statistical\n",
    "distribution of the attribute values.\n",
    "\n",
    "\n",
    "### Attribute Distribution (a-spatial) ###"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import geopandas as gpd\n",
    "import libpysal.api as lp\n",
    "import matplotlib.pyplot as plt\n",
    "import rasterio as rio\n",
    "import numpy as np\n",
    "import contextily as ctx\n",
    "import shapely.geometry as geom\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = gpd.read_file('data/neighborhoods.shp')\n",
    "# was created in previous notebook with df.to_file('data/neighborhoods.shp')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have an `nan` to first deal with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.isnull(df['median_pri']).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df\n",
    "df['median_pri'].fillna((df['median_pri'].mean()), inplace=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sbn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `seaborn` to visualize the statistical distribution of the median\n",
    "price of listings across neighborhoods:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "sbn.distplot(df['median_pri'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `distplot` combines a histogram with a kernel density. Both reflect a\n",
    "right-skewed distribution, not uncommon for housing rents, or urban income\n",
    "distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sbn.distplot(df['median_pri'], rug=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adding the `rug` argument provides additional insight as to the distribution of\n",
    "specific observations across the price range.\n",
    "\n",
    "The histogram and density give us a sense of the \"value\" distribution. From a\n",
    "spatial data science perspective, we are also interested in the \"spatial\"\n",
    "distribution of these values. \n",
    "\n",
    "\n",
    "### Spatial Distribution - Geovisualization Revisited ###\n",
    "\n",
    "Since we have a GeoDataFrame we can call the `plot` method to generate a default\n",
    "choropleth:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot(column='median_pri')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Without prior knowledge of this urban area, interpretation of the visualization\n",
    "is challenging. We can improve things through some additional options:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})\n",
    "df.plot(column='median_pri', scheme='Quantiles', ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've bumped up the size of the map and added an option for `scheme`.\n",
    "The `Quantiles` scheme is one map classification method that GeoPanda includes,\n",
    "which are in turn based on PySAL.\n",
    "\n",
    "This has improved the visualization, but there are still questions. For example,\n",
    "how many different colors (hues) do we see? A choropleth map uses different hues\n",
    "to distinguish different value classes, in a similar sense to the bins of a\n",
    "histogram separating observations with different values.\n",
    "\n",
    "\n",
    "Adding a `legend` argument reveals the nature of the classes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})\n",
    "df.plot(column='median_pri', scheme='Quantiles',  legend=True, ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The five classes in the legend are the quintiles for the attribute distribution.\n",
    "If we instead would like to see the quartile classification, we set a new option `k=4`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})\n",
    "df.plot(column='median_pri', scheme='Quantiles', k=4, legend=True, ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to changing the number of classes, the color map that defines the\n",
    "hues for the classes can be changed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})\n",
    "df.plot(column='median_pri', scheme='Quantiles', k=4, legend=True, ax=ax, \n",
    "        cmap='OrRd')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both the color ramp and the classification scheme can have substantial impacts\n",
    "on the resulting visualization. GeoPandas makes it straightforward to explore\n",
    "these dimensions. For example, changing the color ramp results in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f,ax = plt.subplots(1,3,figsize=(3.16*4,4))\n",
    "df.plot(column='median_pri', ax=ax[0], edgecolor='k',\n",
    "        scheme=\"quantiles\",  k=4)\n",
    "ax[0].axis(df.total_bounds[np.asarray([0,2,1,3])])\n",
    "ax[0].set_title(\"Default\")\n",
    "df.plot(column='median_pri', ax=ax[1], edgecolor='k',\n",
    "        scheme='quantiles', cmap='OrRd', k=4)\n",
    "ax[1].axis(df.total_bounds[np.asarray([0,2,1,3])])\n",
    "ax[1].set_title(\"OrRd\")\n",
    "df.plot(column='median_pri', ax=ax[2], edgecolor='k',\n",
    "        scheme='quantiles', cmap='GnBu', k=4)\n",
    "ax[2].axis(df.total_bounds[np.asarray([0,2,1,3])])\n",
    "ax[2].set_title(\"GnBu\")\n",
    "ax[0].axis('off')\n",
    "ax[1].axis('off')\n",
    "ax[2].axis('off')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively we can change the classification scheme:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f,ax = plt.subplots(1,3,figsize=(3.16*4,4))\n",
    "df.plot(column='median_pri', ax=ax[0], edgecolor='k',\n",
    "        scheme=\"quantiles\", cmap='OrRd', k=4)\n",
    "ax[0].axis(df.total_bounds[np.asarray([0,2,1,3])])\n",
    "ax[0].set_title(\"Quartiles\")\n",
    "df.plot(column='median_pri', ax=ax[1], edgecolor='k',\n",
    "        scheme='equal_interval', cmap='OrRd', k=4)\n",
    "ax[1].axis(df.total_bounds[np.asarray([0,2,1,3])])\n",
    "ax[1].set_title(\"Equal Interval\")\n",
    "df.plot(column='median_pri', ax=ax[2], edgecolor='k',\n",
    "       scheme='fisher_jenks', cmap='OrRd', k=4)\n",
    "ax[2].axis(df.total_bounds[np.asarray([0,2,1,3])])\n",
    "ax[2].set_title(\"Fisher Jenks\")\n",
    "ax[0].axis('off')\n",
    "ax[1].axis('off')\n",
    "ax[2].axis('off')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GeoPandas supports a subset of the map classification schemes that are available\n",
    "in PySAL. The selected ones have a similar function signature that allowed for a\n",
    "simple soft dependency. Since the initial implementation, PySAL has begun a\n",
    "major refactoring, and as part of this the classification scheme have been moved\n",
    "into their own package: `mapclassify`. We can explore some of the other\n",
    "classifiers within this package:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mapclassify.api as mc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`mapclassify` offers a number of schemes for choropleth mapping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mc.CLASSIFIERS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Equal intervals splits the data range into $k$ equal-width bins:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = df['median_pri']\n",
    "ea5 = mc.Equal_Interval(y, k=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This creates a classification object using equal intervals for the\n",
    "classification:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ea5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and, among its attributes are the bin ids each observation is classified into:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ea5.yb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Quantiles, which we also saw from within the GeoPandas df, has a similar\n",
    "interface:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q5 = mc.Quantiles(y, k=5)\n",
    "q5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q5.yb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While equal-width bins have the same width, quantiles attempt to have the same\n",
    "number of observations assigned to each of the $k$ classes. A close look at the\n",
    "result for the quintiles reveals this is not exact for the price data. The\n",
    "reason for this is that there are substantially less than $n$ unique\n",
    "observations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.unique(y).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So care must be taken with regard to quantiles.\n",
    "\n",
    "### Selecting a classification based on fit ###\n",
    "\n",
    "Choropleth mapping can be used for different purposes. The most common is to\n",
    "select a classification that provides a balance between maximizing the\n",
    "differences between observations in each bin, and minimizing the intra-bin heterogeneity.\n",
    "\n",
    "The classifiers in PySAL have underlying measures of fit for this purpose, but\n",
    "it is important to keep in mind these should only be used for classifiers with\n",
    "the same number of classes. One \n",
    "\n",
    "One such measure is the the `absolute deviation around class\n",
    "medians` (ADCM). Let's use this to compare all k=5 classifiers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q5 = mc.Quantiles(y, k=5)\n",
    "ei5 = mc.Equal_Interval(y, k=5)\n",
    "mb5 = mc.Maximum_Breaks(y, k=5)\n",
    "fj5 = mc.Fisher_Jenks(y, k=5)\n",
    "fits = [c.adcm for c in [q5, ei5, mb5, fj5]]\n",
    "fits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, Fisher Jenks is the optimal classifer. \n",
    "\n",
    "\n",
    "### Outlier Detection ###\n",
    "Another application of choropleth mapping is to identify either data errors\n",
    "or extreme values. There are a number of classifers that can be used to detect _value outliers_:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ht = mc.HeadTail_Breaks(y)\n",
    "ht"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "std = mc.Std_Mean(y)\n",
    "std"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bp = mc.Box_Plot(y)\n",
    "bp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using mapclassify with GeoPandas ###\n",
    "\n",
    "Although only a subset of the PySAL classifiers are directly accessible from\n",
    "within GeoPandas, it is possible to combine external classifiers with GeoPandas:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = ['0-low outlier', '1-low whisker',\n",
    "          '2-Q2', '3-Q3', '4-high whisker', '5-high outlier']\n",
    "bpl = [ labels[b] for b in bp.yb ]\n",
    "\n",
    "f, ax = plt.subplots(1, figsize=(9, 9))\n",
    "df.assign(cl=bpl).plot(column='cl', categorical=True, \\\n",
    "                                      k=4, cmap='OrRd', linewidth=0.1, ax=ax,\\\n",
    "                                      edgecolor='grey', legend=True)\n",
    "ax.set_axis_off()\n",
    "plt.show()\n"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}
